| dataset | n | mean_x | sd_x | mean_y | sd_y |
|---|---|---|---|---|---|
| A | 142 | 54.3 | 16.8 | 47.8 | 26.9 |
| B | 142 | 54.3 | 16.8 | 47.8 | 26.9 |
| C | 142 | 54.3 | 16.8 | 47.8 | 26.9 |
| D | 142 | 54.3 | 16.8 | 47.8 | 26.9 |
The Importance of Visualization
Michael Luu, MPH | Marie Lauzon, MS
Biostatistics & Bioinformatics Research Center | Cedars Sinai Medical Center
September 12, 2023
| x | y |
|---|---|
| 55.4 | 97.2 |
| 51.5 | 96.0 |
| 46.2 | 94.5 |
| 42.8 | 91.4 |
| 40.8 | 88.3 |
| 38.7 | 84.9 |
| 35.6 | 79.9 |
| 33.1 | 77.6 |
| 29.0 | 74.5 |
| 26.2 | 71.4 |
| x | y |
|---|---|
| 58.2 | 91.9 |
| 58.2 | 92.2 |
| 58.7 | 90.3 |
| 57.3 | 89.9 |
| 58.1 | 92.0 |
| 57.5 | 88.1 |
| 28.1 | 63.5 |
| 28.1 | 63.6 |
| 28.1 | 63.1 |
| 27.6 | 62.8 |
| x | y |
|---|---|
| 38.3 | 92.5 |
| 35.8 | 94.1 |
| 32.8 | 88.5 |
| 33.7 | 88.6 |
| 37.2 | 83.7 |
| 36.0 | 82.0 |
| 39.2 | 79.3 |
| 39.8 | 82.3 |
| 35.2 | 84.2 |
| 40.6 | 78.5 |
| x | y |
|---|---|
| 56.0 | 79.3 |
| 50.0 | 79.0 |
| 51.3 | 82.4 |
| 51.2 | 79.2 |
| 44.4 | 78.2 |
| 45.0 | 77.9 |
| 48.6 | 78.8 |
| 42.1 | 76.9 |
| 41.0 | 76.4 |
| 34.6 | 72.7 |
| dataset | n | mean_x | sd_x | mean_y | sd_y |
|---|---|---|---|---|---|
| A | 142 | 54.3 | 16.8 | 47.8 | 26.9 |
| B | 142 | 54.3 | 16.8 | 47.8 | 26.9 |
| C | 142 | 54.3 | 16.8 | 47.8 | 26.9 |
| D | 142 | 54.3 | 16.8 | 47.8 | 26.9 |
It appears the counts (n), mean (x), mean (y), and sd (x) and sd (y) are identical for ALL four datasets!
The original “Datasaurus” or “dino” was created by Alberto Cairo in the following blog post1
He was then later made famous by the paper published by Justin Matejka and George Fitzmaurize, titled ‘Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing’2, where they simulated 12 additional datasets in addition to the original “Datasaurus” with nearly identical simple statistics
The datasaurus dozen is a modern take on the classical “Anscombe’s Quartet”1
Comprised of four datasets that have nearly identical simple summary measures, yet have very different distributions and appear vastly different when plotted
| dataset | n | mean_x | sd_x | meay_y | sd_y |
|---|---|---|---|---|---|
| I | 11.00 | 9.00 | 3.32 | 7.50 | 2.03 |
| II | 11.00 | 9.00 | 3.32 | 7.50 | 2.03 |
| III | 11.00 | 9.00 | 3.32 | 7.50 | 2.03 |
| IV | 11.00 | 9.00 | 3.32 | 7.50 | 2.03 |
Useful for small to moderate sized data
Allows us to visualize the spread and distribution of one continuous discrete variables
The X axis is the variable of interest and each dot represents a single observation
Easy to identify the mode
Highlights clusters, gaps, and outliers
Intuitive and easy to understand
Useful for all sized data (small and large)
Allows us to visualize the spread and distribution of continuous variables
Each bar represents a ‘bin’ or a defined interval of values
Although not as common, the width of the bins does NOT have to be equal!
The y axis or the height of the bar represents the count of the number of values that fall into each bin
The y axis is also commonly normalized to ‘relative’ frequencies to show the proportion of cases or density that falls into each bin.
“A distribution is simply a collection of data, or scores, on a variable. Usually, these scores are arranged in order from smallest to largest and then they can be presented graphically.”1
Used to visualize the relationship between two continuous variables
Useful for detecting patterns that are obscured from quantitative summaries like what we observed in Anscombe’s quartet and the Datasaurus dozen.
Useful for visualizing categorical data
Commonly used to present counts and proportion of each level
Allows us to quickly observe the difference in magnitude of each level based on the height of each bar
Although frequently found and prevalent in the literature, this is NOT to be used to describe mean and dispersion (continuous data)
Only shows one arm of the error bar, making overlap comparisons difficult
Promotes misconception of the mean being related to its height rather the position of the top of the bar
Obscures the distribution and spread of the data
Violin plots are box plots, with an overlay of the density distribution (histogram) of the data
More informative than a simple box plot
Visualizes the full distribution of the data
Especially useful for bimodal or multimodal distribution
Biostatistics & Bioinformatics Research Center | Cedars Sinai